深度剖析PostgreSQL中的统计信息

Original xiongcc PostgreSQL学徒

2024-09-30

统计信息

前言

在之前和各位分享了一篇《PostgreSQL优化器解析》，主要讲解了优化器是如何高效工作的。对于复杂查询，优化器会基于既有的统计信息，高效选择一条最优的执行路径，然后交由Executor去执行。这一章开始，我会聊一聊执行计划的解析，由于整体篇幅较长，大概2万字，我分为两篇来发。

上半篇让我们先唠唠PostgreSQL中的统计信息。篇幅较长，涉及到很多枯燥原理和源码分析。

统计信息

统计信息，想必各位耳熟能详，我们先了解一下为什么数据库需要统计信息以及统计信息的作用，假如现在有这么一条SQL：select * from customers join orders on customers.cust_id = orders.customer_id where customers.balance < 1000 and orders.total > 10000，涉及到两表join，那么两个表在连接之前我是先扫描customers表还是orders表呢，根据常识，两表连接之前会让数据尽可能地减少，先过滤掉一部分数据以减少CPU的开销。那么这时优化器不光需要知道两表的总行数，也需要知道每个表中符合条件的行数，即查询谓词的选择率，因此我们就需要通过一种方式知道数据库里数据的一个分布情况和一个大致的行数。

另外如之前我在优化器解析里面所说的，对于一条SQL，数据库可以有多种方式去执行，条条大路通罗马，比如顺序扫描、索引扫描，多表连接的话又有nestloop、hashjoin、mergejoin等。举个通俗的栗子，我要从成都出差去北京，我可以选择坐飞机，也可以坐高铁，也可以坐轮船，甚至可以脑抽了骑车去也行，无外乎效率（money）的问题，数据库也是如此，需要有一种机制告诉它如何去选择一条最优的方式去生成执行计划，这就是统计信息的作用。

在PostgreSQL里面，统计信息存放于pg_statistics系统表中，由于pg_statistics里面的内容人为不易阅读，因此便有了pg_stats视图。

postgres=# \d pg_stats
                     View "pg_catalog.pg_stats"
         Column         |   Type   | Collation | Nullable | Default 
------------------------+----------+-----------+----------+---------
 schemaname             | name     |           |          |     ---模式名
 tablename              | name     |           |          |     ---表名
 attname                | name     |           |          |     ---列名
 inherited              | boolean  |           |          |     ---是否是继承列
 null_frac              | real     |           |          |     ---null空值的比率
 avg_width              | integer  |           |          |     ---平均宽度，字节
 n_distinct             | real     |           |          |     ---大于零就是非重复值的数量，小于零则是非重复值的个数除以行数
 most_common_vals       | anyarray |           |          |     ---高频值
 most_common_freqs      | real[]   |           |          |     ---高频值的频率
 histogram_bounds       | anyarray |           |          |     ---直方图
 correlation            | real     |           |          |     ---物理顺序和逻辑顺序的关联性
 most_common_elems      | anyarray |           |          |     ---高频元素，比如数组
 most_common_elem_freqs | real[]   |           |          |     ---高频元素的频率
 elem_count_histogram   | real[]   |           |          |     ---直方图（元素）

随便新建一个表，顺序插入10W行数据

postgres=# create table test(id int);
CREATE TABLE
postgres=# insert into test values(generate_series(1,100000));
INSERT 0 100000
postgres=# analyze test;
ANALYZE
postgres=# select * from pg_stats where tablename = 'test';
-[ RECORD 1 ]----------+---------------------------------------------------------------------------------------------------------
schemaname             | public
tablename              | test
attname                | id
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | -1
most_common_vals       | 
most_common_freqs      | 
histogram_bounds       | {6,1052,2075,3083,4085,5095,6126,7088,8101,9182,10199,11174,12095,13044,14014,14993,16032,17034,17997,19015,19983,21025,22016,22980,23904,24911,25982,27022,28024,29089,30111,31122,32116,33082,34112,35041,36012,37099,38090,39118,40052,41041,42046,43020,43991,44920,45932,46966,47952,48917,49938,50929,51857,52904,53872,54933,55906,56876,57882,58940,59951,60951,62015,63016,64128,65096,66091,66974,67983,68951,69920,70944,71949,72998,73900,74885,75844,76797,77885,78869,79833,80834,81808,82897,83917,84933,85913,86997,87999,89039,89943,90902,91885,92827,93849,94874,95911,96965,97998,98907,99998}
correlation            | 1
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   |

现在我们来分析一下每个字段的作用

n_distinct是-1，表示在某一列中的非重复值与总行数相同，用负数形式是因为analyze认为非重复值的数目是随着表增长而增长的，用正数表示字段看上去好像有固定数量的可能值
null_frac为0，表示没有null值，注意null并不会实际插入到数据中，而是以TupleHeader中的位图来表示的，t_bits字段，一个字段一个比特位，同时结合infomask中的HEAP_HASNULL标志位进行标识
most_common_vals：MCV，高频值列表，此例因为是顺序插入的，数据分布均匀，所以没有高频值
most_common_freqs：MCF，高频值的频率
histogram_bounds：直方图，表示该字段除了高频值以外值的的柱状图信息，直方图中的数据不包含MCV/MCF的部分，两者的值是补充关系而且不会重合，但不一定互补（两种加起来未必是全部数据），用于将对应列的值划分为多个分组。
“
If the MCV cannot be used, the value of the histogram_bounds of the target column is used to estimate the cost. histogram_bounds is a list of values that divide the column's values into groups of approximately equal population.
correlation：值的物理顺序和逻辑顺序的关联度，-1到1之间，1表示逻辑顺序与存储的物理顺序相同，-1表示逻辑顺序与存储的物理顺序相反

correlation的作用

那correlation有什么作用呢？这一点可以参照 internal db的样例，假如现在有一个表如下

testdb=# \d tbl_corr
    Table "public.tbl_corr"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 col      | text    | 
 col_asc  | integer | 
 col_desc | integer | 
 col_rand | integer | 
 data     | text    |
Indexes:
    "tbl_corr_asc_idx" btree (col_asc)
    "tbl_corr_desc_idx" btree (col_desc)
    "tbl_corr_rand_idx" btree (col_rand)
    
testdb=# SELECT col,col_asc,col_desc,col_rand FROM tbl_corr;
   col    | col_asc | col_desc | col_rand 
----------+---------+----------+----------
 Tuple_1  |       1 |       12 |        3
 Tuple_2  |       2 |       11 |        8
 Tuple_3  |       3 |       10 |        5
 Tuple_4  |       4 |        9 |        9
 Tuple_5  |       5 |        8 |        7
 Tuple_6  |       6 |        7 |        2
 Tuple_7  |       7 |        6 |       10
 Tuple_8  |       8 |        5 |       11
 Tuple_9  |       9 |        4 |        4
 Tuple_10 |      10 |        3 |        1
 Tuple_11 |      11 |        2 |       12
 Tuple_12 |      12 |        1 |        6
(12 rows)

testdb=# SELECT tablename,attname, correlation FROM pg_stats WHERE tablename = 'tbl_corr';
 tablename | attname  | correlation 
-----------+----------+-------------
 tbl_corr  | col_asc  |           1
 tbl_corr  | col_desc |          -1
 tbl_corr  | col_rand |    0.125874
(3 rows)

表上有三个索引，其中

col_asc索引是顺序递增的
col_desc索引是倒序递减的
col_rand索引则是随机分布的

因此假如有个SQL需要查询2到4的数据 SELECT * FROM tbl_corr WHERE col_asc BETWEEN 2 AND 4;，对于col_asc，只需要读第一个页面即可

而假如是对随机列进行查询的话，SELECT * FROM tbl_corr WHERE col_rand BETWEEN 2 AND 4;则需要读取所有的页面

因此在索引扫描的时候也会将correlation考虑进去 ‘table IO cost’=max_IO_cost+indexCorrelation2×(min_IO_cost−max_IO_cost).，更多细节请参考 https://www.interdb.jp/pg/pgsql03.html。

“
This way, the index correlation is a statistical correlation that reflects the influence of random access caused by the twist between the index ordering and the physical tuple ordering in the table in estimating the index scan cost.

在PostgreSQL中，我们可以使用cluster命令进行聚簇，对于某些时序类的数据进行范围查询，会有性能提升。

postgres=# create table test(id int);
CREATE TABLE
postgres=# insert into test SELECT ceil(random() * 10) AS num FROM generate_series(1,10);
INSERT 0 10
postgres=# select * from test;
 id 
----
  7
 10
  5
  2
  8
  2
  5
  5
  7
  1
(10 rows)
postgres=# analyze test;
ANALYZE
postgres=# select correlation from pg_stats where tablename  = 'test';
 correlation 
-------------
 -0.33333334
(1 row)
postgres=# create index on test(id);
CREATE INDEX
postgres=# cluster test USING test_id_idx ;
CLUSTER
postgres=# analyze test;
ANALYZE
postgres=# select correlation from pg_stats where tablename  = 'test';
 correlation 
-------------
           1
(1 row)

不过这种是一次性的，只对存量数据有效，增量数据无法保证，并且是8级锁，因此要注意时间窗口。

回到直方图，不难想象，直方图的分组越多，统计信息也就越精细准确。我们可以合理调整某些表的收集阈值，但代表就是优化器规划时间（planning time）会变久，统计信息收集的开销也会变大。

postgres=# alter table test alter column id set statistics 200;
ALTER TABLE

这样就会收集200个bucket了。在Internal DB的Query Processing章节有个例子

testdb=# SELECT histogram_bounds FROM pg_stats WHERE tablename = 'tbl' AND attname = 'data';
                       histogram_bounds
---------------------------------------------------------------------------------------------------
 {1,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,
2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,
4200,4300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,
6200,6300,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,7600,7700,7800,7900,8000,8100,
8200,8300,8400,8500,8600,8700,8800,8900,9000,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000}
(1 row)

默认情况下会采集100个bucket（由default_statistics_target参数控制），以上面这个例子为例，bucket[0].min是1，bucket[0].max是100，以此类推。

Fig. 3.7. Buckets and histogram_bounds.

直方图的分类

直方图的基本原理是将数据排序后分成若干个桶（bucket），并记录每个桶中数据的最大值、最小值、出现频次占比等信息。按照数据分桶的方式和桶中储存的数据可以大致分为如下几类（常见的）

Equi-width Histogram
Equi-height Histogram
Singleton Histogram
Compressed Histogram
End-biased histogram
Serial Histogram

Equi-width Histogram

Equi-width Histogram，即等宽直方图，顾名思义，等宽直方图将数据最大、小值之间的区间等分为N份，每个桶中最大、小值之差都为整体数据最大、小值之差/N，即所谓的"等宽"。

举个栗子，假设某一列各个值的分布如下

划分为4个桶的话，那么等宽直方图长这个样子 👇🏻

简洁明了，但是 Equi-width Histogram 最大的缺陷是在数据频次较高的桶中统计信息不够清晰，比如在第二个桶中，总频次是19，但是我们并不知道是实际情况那样（5出现了1次，6出现了6次，7出现了4次，8出现了8次），还是5出现了16次，6、7、8各出现1次。

因此，当桶的数量远小于列中 distinct value 数量、单个桶中 distinct value 过多且分布不均时，Equi-width Histogram 很有可能做出错误的估算并影响优化结果。当然好处就是易于理解和收集了。

Singleton Histogram

Singleton Histogram 可以视为等宽直方图在桶数量与 distinct value 数目相等时的特殊情况，每个桶都只记录一个值的出现频次。Singleton Histogram 能够提供最为精确的数据分布信息，但当列中的 distinct value 较多时，Singleton Histogram 占用的内存也会到达一个难以接受的数值。

Equi-height Histogram

Equi-height Histogram，即等高直方图，也可以称作为 Equi-depth Histogram，不同于等宽直方图，等高直方图会保证每个桶中数值的频次之和为总行数的 1/N 。

还是以前面那个为例，假如是等高直方图的话，划分为4个桶，那么分布情况如下👇🏻

可以看到，数据集中的区间内每个桶中的数值跨度更小，这样有利于增加选择率估算的准确性；而数据分散的区间内每个桶中的数值跨度更大，有利于减小储存直方图所消耗的内存。

但是等高直方图也有自己的缺点，当存在某个频次占比远高于总行数/N 的数值时，等桶中的数据分布将出现较大倾斜，会导致其他桶相对较宽，仅当数据分布偏差不大时才适用于范围查询。

Compressed Histogram

前面提到了，对于 Equi-height Hisgotram 和 Singleton Histogram 都比较尴尬的分布方式：少数数值出现频率极高，而大多数数值则占比不大且分布较为均匀。对于这样的数据，使用 Equi-height Hisgotram 会遇到高占比数值引起的统计数据倾斜，而使用 Singleton Histogram 则需要对占比不大的数据也单独建桶，颇有大材小用之嫌。

对于这样的数据，就适合建立 Compressed Histogram。它是Equi-height Hisgotram 和 Singleton Histogram 的结合，既会分配单独的桶给出现频次较高、对选择率影响较大的数值，也会建立类似于等高直方图的桶来聚合频次较低、没有必要单独建桶的数值，同时集合了 Singleton Histogram 的高精度和 Equi-height Hisgotram 的高聚合度。

在PostgreSQL和PolarDB中，便使用了 Compressed Histogram。在 pg_statistic.h 源码中，有这一块的简述：

“
thus, it's a "compressed histogram" in the technical parlance

/*
 * A "histogram" slot describes the distribution of scalar data.  staop is
 * the OID of the "<" operator that describes the sort ordering, and stacoll
 * is the relevant collation.  (In theory more than one histogram could appear,
 * if a datatype has more than one useful sort operator or we care about more
 * than one collation.  Currently the collation will always be that of the
 * underlying column.)  stavalues contains M (>=2) non-null values that
 * divide the non-null column data values into M-1 bins of approximately equal
 * population.  The first stavalues item is the MIN and the last is the MAX.
 * stanumbers is not used and should be NULL.  IMPORTANT POINT: if an MCV
 * slot is also provided, then the histogram describes the data distribution
 * *after removing the values listed in MCV* (thus, it's a "compressed
 * histogram" in the technical parlance).  This allows a more accurate
 * representation of the distribution of a column with some very-common
 * values.  In a column with only a few distinct values, it's possible that
 * the MCV list describes the entire data population; in this case the
 * histogram reduces to empty and should be omitted.
 */


#define STATISTIC_KIND_HISTOGRAM  2

“
一个 "直方图 "槽描述了标量数据的分布。staop是描述排序顺序的"<"运算符的OID，而stacoll是相关的排序方式。(理论上，如果一个数据类型有多个有用的排序操作符，或者我们关心多个排序，就可以出现多个直方图。stavalues包含M（>=2）个非空值，将非空列的数据值分为M-1个数量大致相同的仓库。第一个 stavalues 项是 MIN，最后一个是 MAX。stanumbers 不使用，应该是 NULL。重要提示：如果还提供了MCV，那么直方图描述的是去除MCV中所列数值后的数据分布（因此，用技术术语来说是 "压缩直方图"）。这可以更准确地表示具有一些非常常见的值的列的分布。在一个只有几个不同数值的列中，有可能MCV列表就描述了整个数据分布；在这种情况下，直方图就会变成空的，应该被省略掉。

现在回过头来看pg_stats里面的内容也能窥见一二，n_distinct用于单独记录非重复值的比例，histogram_bounds和MCV/MCF互补用于评估值的分布情况。

n_distinct是-1，表示在某一列中的非重复值与总行数相同，用负数形式是因为analyze认为非重复值的数目是随着表增长而增长的，用正数表示字段看上去好像有固定数量的可能值
null_frac为0，表示没有null值，注意null并不会实际插入到数据中，而是以TupleHeader中的位图来表示的，t_bits字段，一个字段一个比特位，同时结合infomask中的HEAP_HASNULL标志位进行标识
most_common_vals：MCV，高频值列表，此例因为是顺序插入的，数据分布均匀，所以没有高频值
most_common_freqs：MCF，高频值的频率
histogram_bounds：直方图，表示该字段除了高频值以外值的的柱状图信息，直方图中的数据不包含MCV/MCF的部分，两者的值是补充关系而且不会重合，但不一定互补（两种加起来未必是全部数据），用于将对应列的值划分为多个分组。

Serial Histogram

采集算法

前面讲到了直方图的分类，必然还要涉及到直方图中的数据采集算法。

各个数据库之间有所差异，查阅了一下资料：

MariaDB很暴力，采用全表扫的方式，优势就是相对精细准确，当然代价也十分昂贵
PostgreSQL采用的是采样，选取部分数据进行获取
MySQL从 8.0之后开始有了直方图，查了一下资料，可以选择全表扫也可以选择抽样，由histogram_generation_max_mem_size参数控制

当然可能更高效的方式是类似增量检查点的方式，增量更新统计信息，然后搭配数据库的动态优化能力，动态根据上一步的实际执行统计信息（每个node结果集的柱状图、高频词、记录数等）动态调整下一步的执行计划，对于复杂Join尤其有效。

统计信息的采集流程

统计信息的收集流程逻辑在analyze.c中，简单分析一下源码逻辑

获取锁

 /*
  * Open the relation, getting ShareUpdateExclusiveLock to ensure that two
  * ANALYZEs don't run on it concurrently.  (This also locks out a
  * concurrent VACUUM, which doesn't matter much at the moment but might
  * matter if we ever try to accumulate stats on dead tuples.) If the rel
  * has been dropped since we last saw it, we don't need to process it.
  *
  * Make sure to generate only logs for ANALYZE in this case.
  */
 onerel = vacuum_open_relation(relid, relation, params->options & ~(VACOPT_VACUUM),
          params->log_min_duration >= 0,
          ShareUpdateExclusiveLock);

第一步需要获取目标对象上的锁，因为analyze命令可以

一把梭分析整个数据库
分析指定的某几个表
分析指定表的某几个列

明确要分析哪些表之后便需要加锁，和vacuum一样使用4级锁ShareUpdateExclusive，因此不允许并发analyze，另外在analyze的过程中也会阻塞vacuum（虽说理论上二者可以不冲突，但是注释写了会干扰死元祖的统计）。

校验权限

 /*
  * Silently ignore tables that are temp tables of other backends ---
  * trying to analyze these is rather pointless, since their contents are
  * probably not up-to-date on disk.  (We don't throw a warning here; it
  * would just lead to chatter during a database-wide ANALYZE.)
  */
 if (RELATION_IS_OTHER_TEMP(onerel))
 {
  relation_close(onerel, ShareUpdateExclusiveLock);
  return;
 }

这一步顾名思义，只有表的owner、数据库的owner或者超级用户才可以执行analyze。

postgres=> analyze u1_t ;
WARNING:  skipping "u1_t" --- only table or database owner can analyze it
ANALYZE

并且也不能收集其他会话创建的临时表，不过现象是不会发出任何告警，只是悄悄地略过。

---非创建临时表的会话
postgres=# analyze verbose pg_temp_4.temp_test ;
ANALYZE

---创建临时表的会话
postgres=# analyze verbose pg_temp_4.test_tmp ;
INFO:  analyzing "pg_temp_4.test_tmp"
INFO:  "test_tmp": scanned 443 of 443 pages, containing 100000 live rows and 0 dead rows; 30000 rows in sample, 100000 estimated total rows
ANALYZE

确定采样函数

接着确定如何进行采样，如果是普通表或者物化视图，则采样函数采用acquire_sample_rows；如果是外部表，需要使用外部表提供的hook，如postgres_fdw的postgresAnalyzeForeignTable，假如不支持就会跳过。

 /*
  * Check that it's of an analyzable relkind, and set up appropriately.
  */
 if (onerel->rd_rel->relkind == RELKIND_RELATION ||
  onerel->rd_rel->relkind == RELKIND_MATVIEW)
 {
  /* Regular table, so we'll use the regular row acquisition function */
  acquirefunc = acquire_sample_rows;       ---函数指针
  /* Also get regular table's size */
  relpages = RelationGetNumberOfBlocks(onerel);  ---获取表的大小
 }
 else if (onerel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
 {
  /*
   * For a foreign table, call the FDW's hook function to see whether it
   * supports analysis.
   */
  FdwRoutine *fdwroutine;
  bool  ok = false;

在 acquire_sample_rows() 里面可以看到，采样也支持预取，由 effective_io_concurrency 参数指定，使用标准 posix_fadvice()，采集流程在后文详细分析。

#ifdef USE_PREFETCH

 /*
  * If we are doing prefetching, then go ahead and tell the kernel about
  * the first set of pages we are going to want.  This also moves our
  * iterator out ahead of the main one being used, where we will keep it so
  * that we're always pre-fetching out prefetch_maximum number of blocks
  * ahead.
  */
 if (prefetch_maximum)
 {
  for (int i = 0; i < prefetch_maximum; i++)
  {
   BlockNumber prefetch_block;

   if (!BlockSampler_HasMore(&prefetch_bs))
    break;

   prefetch_block = BlockSampler_Next(&prefetch_bs);
   PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
  }
 }
#endif

收集统计信息

确定好了采样函数之后，接着进行初始化，将待分析的表传入到 do_analyze_rel() 中，开始收集统计信息。在此之前，需要进一步确定要分析一个表中的哪些列：用户可能指定只分析表中的某几个列，被频繁访问的列才更有被分析的价值，然后还要打开待分析表的所有索引，看看是否有可以被分析的列。

 /*
  * Determine which columns to analyze
  *
  * Note that system attributes are never analyzed, so we just reject them
  * at the lookup stage.  We also reject duplicate column mentions.  (We
  * could alternatively ignore duplicates, but analyzing a column twice
  * won't work; we'd end up making a conflicting update in pg_statistic.)
  */
 if (va_cols != NIL)
 {
  Bitmapset  *unique_cols = NULL;
  ListCell   *le;

  vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) *
            sizeof(VacAttrStats *));
  tcnt = 0;
  foreach(le, va_cols)
  {
   char    *col = strVal(lfirst(le));

   i = attnameAttNum(onerel, col, false);
   if (i == InvalidAttrNumber)
    ereport(ERROR,
      (errcode(ERRCODE_UNDEFINED_COLUMN),
       errmsg("column \"%s\" of relation \"%s\" does not exist",
        col, RelationGetRelationName(onerel))));
   if (bms_is_member(i, unique_cols))
    ereport(ERROR,
      (errcode(ERRCODE_DUPLICATE_COLUMN),
       errmsg("column \"%s\" of relation \"%s\" appears more than once",
        col, RelationGetRelationName(onerel))));
   unique_cols = bms_add_member(unique_cols, i);

   vacattrstats[tcnt] = examine_attribute(onerel, i, NULL); ---逻辑在此处
   if (vacattrstats[tcnt] != NULL)
    tcnt++;
  }
  attr_cnt = tcnt;
 }

在真正开始分析之前，需要先检查每个字段，并返回VacAttrStats结构体。后面所有的分析都将在此检查之上进行。

具体逻辑在 examine_attribute()里面

/*
 * examine_attribute -- pre-analysis of a single column
 *
 * Determine whether the column is analyzable; if so, create and initialize
 * a VacAttrStats struct for it.  If not, return NULL.
 *
 * If index_expr isn't NULL, then we're trying to analyze an expression index,
 * and index_expr is the expression tree representing the column's data.
 */
static VacAttrStats *
examine_attribute(Relation onerel, int attnum, Node *index_expr)
{
 Form_pg_attribute attr = TupleDescAttr(onerel->rd_att, attnum - 1);
 HeapTuple typtuple;
 VacAttrStats *stats;
 int   i;
 bool  ok;

 /* Never analyze dropped columns */   ---列如果被删除
 if (attr->attisdropped)
  return NULL;

 /* Don't analyze column if user has specified not to */   ---指定了列
 if (attr->attstattarget == 0)
  return NULL;
  
...

确定这个字段是否可以分析，如果不可以，则返回NULL。两种情况致使这个字段不进行分析：字段已被删除（已删除的字段还存在于系统表中，只是作了标记）；用户指定了字段。

获取数据类型，并决定针对该类型的采样数据量和统计函数不同的类型，其分析函数也不同，对于数组使用array_typanalyze。如果该类型没有对应的分析函数，则采用标准的分析函数std_typanalyze。

 /*
  * Call the type-specific typanalyze function.  If none is specified, use
  * std_typanalyze().
  */
 if (OidIsValid(stats->attrtype->typanalyze))  ---数组类型
  ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
             PointerGetDatum(stats)));
 else
  ok = std_typanalyze(stats);   ---普通类型或者其他类型

 /*
  * Custom ANALYZE procedure for the datatype (0 selects the default).
  */
 regproc  typanalyze BKI_DEFAULT(-) BKI_ARRAY_DEFAULT(array_typanalyze) BKI_LOOKUP_OPT(pg_proc);

看下标准函数std_typanalyze

/*
 * std_typanalyze -- the default type-specific typanalyze function
 */
bool
std_typanalyze(VacAttrStats *stats)
{
 Form_pg_attribute attr = stats->attr;
 Oid   ltopr;
 Oid   eqopr;
 StdAnalyzeData *mystats;

 /* If the attstattarget column is negative, use the default value */
 /* NB: it is okay to scribble on stats->attr since it's a copy */
 if (attr->attstattarget < 0)
  attr->attstattarget = default_statistics_target;  ---默认采集的桶数量

 /* Look for default "<" and "=" operators for column's type */
 get_sort_group_operators(stats->attrtypid,
        false, false, false,
        &ltopr, &eqopr, NULL,
        NULL);

 /* Save the operator info for compute_stats routines */
 mystats = (StdAnalyzeData *) palloc(sizeof(StdAnalyzeData));
 mystats->eqopr = eqopr;
 mystats->eqfunc = OidIsValid(eqopr) ? get_opcode(eqopr) : InvalidOid;
 mystats->ltopr = ltopr;
 stats->extra_data = mystats;

 /*
  * Determine which standard statistics algorithm to use
  */
 if (OidIsValid(eqopr) && OidIsValid(ltopr))
 {
  /* Seems to be a scalar datatype */
  stats->compute_stats = compute_scalar_stats;
  /*--------------------
   * The following choice of minrows is based on the paper
   * "Random sampling for histogram construction: how much is enough?"
   * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
   * Proceedings of ACM SIGMOD International Conference on Management
   * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
   * says that for table size n, histogram size k, maximum relative
   * error in bin size f, and error probability gamma, the minimum
   * random sample size is
   *  r = 4 * k * ln(2*n/gamma) / f^2
   * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
   *  r = 305.82 * k
   * Note that because of the log function, the dependence on n is
   * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
   * bin size error with probability 0.99.  So there's no real need to
   * scale for n, which is a good thing because we don't necessarily
   * know it at this point.
   *--------------------
   */
  stats->minrows = 300 * attr->attstattarget;
 }
 else if (OidIsValid(eqopr))
 {
  /* We can still recognize distinct values */
  stats->compute_stats = compute_distinct_stats;
  /* Might as well use the same minrows as above */
  stats->minrows = 300 * attr->attstattarget;
 }
 else
 {
  /* Can't do much but the trivial stuff */
  stats->compute_stats = compute_trivial_stats;
  /* Might as well use the same minrows as above */
  stats->minrows = 300 * attr->attstattarget;
 }

 return true;
}

这里是重点。单独分析一下

采集函数

可以看到，默认采集的桶树由default_statistics_target参数控制，然后获取列的操作符类型，对于不同类型的列，compute_stats 函数指针指向的计算函数不太一样。可以看到分为三类

/* Return results as needed */
 if (ltOpr)
  *ltOpr = lt_opr;   ---less than，小于
 if (eqOpr)
  *eqOpr = eq_opr;   ---equal，等于
 if (gtOpr)
  *gtOpr = gt_opr;   ---greater than，大于
 if (isHashable)
  *isHashable = hashable;
}

compute_scalar_stats，如果说一个列的数据类型支持默认的 =（eqopr：equals operator）和 <（ltopr：less than operator），那么这个列应该是一个数值类型，可以使用 compute_scalar_stats() 函数进行分析
compute_distinct_stats，如果列的数据类型只支持 = 运算符，那么依旧还可以使用 compute_distinct_stats 进行唯一值的统计分析
compute_trivial_stats，如果都不行，那么这个列只能使用 compute_trivial_stats 进行一些简单的分析

划重点，可以看到，不同的数据类型，采样的数量都是stats->minrows = 300 * attr->attstattarget

因此，对于默认值default_statistics_target，那么采样的数量就是 300 * 100 = 30000行，没错，3万行。接下来用于确定最小的阈值，内核代码分配了一块相应长度的元组数组，然后开始使用 acquirefunc 函数指针指向的acquire_sample_rows()进行采样，可以看到，最小采集100行

 /*
  * Determine how many rows we need to sample, using the worst case from
  * all analyzable columns.  We use a lower bound of 100 rows to avoid
  * possible overflow in Vitter's algorithm.  (Note: that will also be the
  * target in the corner case where there are no analyzable columns.)
  */
 targrows = 100;
 for (i = 0; i < attr_cnt; i++)
 {
  if (targrows < vacattrstats[i]->minrows)
   targrows = vacattrstats[i]->minrows;
 }
 for (ind = 0; ind < nindexes; ind++)
 {
  AnlIndexData *thisdata = &indexdata[ind];

  for (i = 0; i < thisdata->attr_cnt; i++)
  {
   if (targrows < thisdata->vacattrstats[i]->minrows)
    targrows = thisdata->vacattrstats[i]->minrows;
  }
 }

接着，就开始正式采集了，acquirefunc是个函数指针，前面提过，对于非继承列，使用acquire_sample_rows()

 /*
  * Acquire the sample rows
  */
 rows = (HeapTuple *) palloc(targrows * sizeof(HeapTuple));
 pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
         inh ? PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS_INH :
         PROGRESS_ANALYZE_PHASE_ACQUIRE_SAMPLE_ROWS);
 if (inh)
  numrows = acquire_inherited_sample_rows(onerel, elevel,
            rows, targrows,
            &totalrows, &totaldeadrows);
 else
  numrows = (*acquirefunc) (onerel, elevel,
          rows, targrows,
          &totalrows, &totaldeadrows);

正式采样

acquire_sample_rows采用的两阶段算法，简单看一下注释即可

/*
 * acquire_sample_rows -- acquire a random sample of rows from the table
 *
 * Selected rows are returned in the caller-allocated array rows[], which
 * must have at least targrows entries.
 * The actual number of rows selected is returned as the function result.
 * We also estimate the total numbers of live and dead rows in the table,
 * and return them into *totalrows and *totaldeadrows, respectively.
 *
 * The returned list of tuples is in order by physical position in the table.
 * (We will rely on this later to derive correlation estimates.)
 *
 *
 // 自 2004 年 5 月起，我们使用了一种新的两阶段方法：第一阶段选择最多 targrows 个随机块
 //（或所有块，如果没有那么多）。第二阶段扫描这些块并使用Vitter算法创建一个随机样本的 targrows 
 // 行（或更少，如果块样本中的样本较少）。这两个阶段同时执行：每个块在第一阶段返回其编号时立即处
 // 理，并且在读取行时，第二阶段控制哪些要插入到样本中。

 * As of May 2004 we use a new two-stage method:  Stage one selects up
 * to targrows random blocks (or all blocks, if there aren't so many).
 * Stage two scans these blocks and uses the Vitter algorithm to create
 * a random sample of targrows rows (or less, if there are less in the
 * sample of blocks).  The two stages are executed simultaneously: each
 * block is processed as soon as stage one returns its number and while
 * the rows are read stage two controls which ones are to be inserted
 * into the sample.
 *
 
 // 尽管每一行进入最终样本的机会均等，但这种抽样方法并不完美：并非每个可能的样本都有均等的机会被
 // 选中。对于大表，样本所代表的不同块的数量往往太少。我们现在可以忍受。欢迎改进。
 * Although every row has an equal chance of ending up in the final
 * sample, this sampling method is not perfect: not every possible
 * sample has an equal chance of being selected.  For large relations
 * the number of different blocks represented by the sample tends to be
 * too small.  We can live with that for now.  Improvements are welcome.
 *
 * An important property of this sampling method is that because we do
 * look at a statistically unbiased set of blocks, we should get
 * unbiased estimates of the average numbers of live and dead rows per
 * block.  The previous sampling method put too much credence in the row
 * density near the start of the table.
 */
static int
acquire_sample_rows(Relation onerel, int elevel,
                    HeapTuple *rows, int targrows,
                    double *totalrows, double *totaldeadrows)
{
    // ...

    /* Outer loop over blocks to sample */
    while (BlockSampler_HasMore(&bs))
    {
        bool        block_accepted;
        BlockNumber targblock = BlockSampler_Next(&bs);
        // ...
    }

    // ...

    /*
     * If we didn't find as many tuples as we wanted then we're done. No sort
     * is needed, since they're already in order.
     *
     * Otherwise we need to sort the collected tuples by position
     * (itempointer). It's not worth worrying about corner cases where the
     * tuples are already sorted.
     */
    if (numrows == targrows)
        qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);

    /*
     * Estimate total numbers of live and dead rows in relation, extrapolating
     * on the assumption that the average tuple density in pages we didn't
     * scan is the same as in the pages we did scan.  Since what we scanned is
     * a random sample of the pages in the relation, this should be a good
     * assumption.
     */
    if (bs.m > 0)
    {
        *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);
        *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
    }
    else
    {
        *totalrows = 0.0;
        *totaldeadrows = 0.0;
    }

    // ...
}

总共分为两个阶段进行采集：

phase 1：随机选择包含目标采样行数的数据块
phase 2：对每一个数据块使用 Vitter 算法按行随机采样数据

两阶段同时进行，并且可以看到采样过程也会受到vacuum_cost_delay的影响。在采样完成后，被采样到的元组应该已经被放置在元组数组中了。对这个元组数组按照元组的位置进行快速排序，并使用这些采样到的数据估算整个表中的存活元组与死元组的个数。

统计数据

这里参照一下《PostgreSQL · 特性分析 · 统计信息计算方法》

在获取到相应样本数据后，针对每个字段分别进行分析。首先会依据当前字段的值，对记录进行排序。因在取出样本数据的时候，按照tuple在磁盘中的位置顺序取出的，因此对值进行排序后即可计算得出相关性。另外，在排序后，也更容易计算统计值的频率，从而得出MCV和MCF。这里采用的快速排序！

之后，会根据每个值进行分析：

如果是NULL，则计数 NULL概率的计算公式是：stanullfrac = null_number / sample_row_number。
如果是变长字段，如text等，则需要计算平均宽度
计算出现最多的值，和相应频率
基数的计算，该部分计算稍微复杂一些，分为以下三种情况：

当采样中的值没有重复的时候，则认为所有的值唯一，stadistinct = -1。
当采样中的每个值都出现重复的时候，则认为基数有限，则stadistinct = distinct_value_number
当采样中的值中，存在有唯一值并且存在不唯一值的时候，则依据以下的公式（by Haas and Stokes in IBM Research Report RJ 10025）：n * d / (n - f1 + f1 * n/N)，其中，N是指所有的记录数，即pg_class.reltuples；n是指sample_row_number，即采样的记录数；f1则是只出现一次的值的数据；d则是采样中所有的值的数量。

MCV / MCF 并不是所有采样的值都会被列入MCV/MCF。首先是如果可以，则将所有采样的记录放到MCV中，如表所有的记录都已经取作采样的时候；其次，则是选取那些出现频率超过平均值的值，事实上是超过平均值的25%；那些出现频率大于直方图的个数的倒数的时候等。

直方图计算直方图，会首先排除掉MCV中的值。意思是直方图中的数据不包含MCV/MCF的部分，两者的值是补充关系而且不会重合，但不一定互补（两种加起来未必是全部数据）。这个也与成本的计算方式有关系，请参考row-estimation-examples 。其计算公式相对比较简单，如下：values[(i * (nvals - 1)) / (num_hist - 1)]，i指直方图中的第几列；nvals指当前还有多少个值；num_hist则指直方图中还有多少列。计算完成后，kind的值会被置为2。到此，采样的统计基本结束。

完成采样的计算后，通过內部函数更新相关的统计信息到pg_statistic，更新relpages和totale rows到pg_class中。即完成了一次统计信息的收集。

小结

小结一下，可以看到默认情况下，采集的行数会有个限制，最小100行，最大 300 * default_statistics_target ，不足100行的话，则获取全部数据，随机获取数据块，随机获取数据，假如数据量足够的话，默认随机选择 300 * default_statistics_target 个数据块，然后从中随机选择 300 * default_statistics_target 行数据。

postgres=# insert into t1 values(generate_series(1,99));   ---99行数据
INSERT 0 99
postgres=# insert into t2 values(generate_series(1,101));   ---101行数据
INSERT 0 101
postgres=# insert into t3 values(generate_series(1,30001));  ---30001行数据
INSERT 0 30001
postgres=# create index t3_idx on t3(id);
CREATE INDEX
postgres=# analyze verbose t1;
INFO:  analyzing "public.t1"
INFO:  "t1": scanned 1 of 1 pages, containing 99 live rows and 0 dead rows; 99 rows in sample, 99 estimated total rows
ANALYZE
postgres=# analyze verbose t2; 
INFO:  analyzing "public.t2"
INFO:  "t2": scanned 1 of 1 pages, containing 101 live rows and 0 dead rows; 101 rows in sample, 101 estimated total rows
ANALYZE
postgres=# analyze verbose t3; ---统计了3W行
INFO:  analyzing "public.t3"
INFO:  "t3": scanned 133 of 133 pages, containing 30001 live rows and 0 dead rows; 30000 rows in sample, 30001 estimated total rows
ANALYZE

总结

以上便是PostgreSQL中关于统计信息的种种了。后面部分我们来分析一下执行计划是如何根据统计信息生成的，以及如何阅读评估执行计划。

参考

https://www.interdb.jp/pg/pgsql03.html

PostgreSQL 内核 ANALYZE 背后的事

PostgreSQL · 特性分析 · 统计信息计算方法

MySQL · 内核特性 · 直方图

Histograms in MariaDB, MySQL and PostgreSQL

继续滑动看下一个

PostgreSQL学徒

向上滑动看下一个

【资源分享】【收集不易多多分享】【2024年12月29日】宝山区区面试流程和结构化真题20道其中5题幼儿（独家请老师给参考答案）

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

女性最佳“绝经期”，不是45岁，而是这个数，越接近越健康！

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

深度剖析PostgreSQL中的统计信息

统计信息

前言

统计信息

correlation的作用

直方图的分类

Equi-width Histogram

Singleton Histogram

Equi-height Histogram

Compressed Histogram

Serial Histogram

采集算法

统计信息的采集流程

获取锁

校验权限

确定采样函数

收集统计信息

采集函数

正式采样

统计数据

小结

总结

参考

您可能也对以下帖子感兴趣

【资源分享】【收集不易多多分享】【2024年12月29日】宝山区区面试流程和结构化真题20道其中5题幼儿（独家请老师给参考答案）

《鱿鱼游戏2》今天下午四点开播，网友无心上班了，导演悄悄剧透

人民日报征集“中美友好合作故事”，令人感奋

女性最佳“绝经期”，不是45岁，而是这个数，越接近越健康！

刘恺威近况曝光，父亲刘丹证实已分手，目前失业在家，没有资源

生成图片，分享到微信朋友圈

深度剖析PostgreSQL中的统计信息

统计信息

前言

统计信息

correlation的作用

直方图的分类

Equi-width Histogram

Singleton Histogram

Equi-height Histogram

Compressed Histogram

Serial Histogram

采集算法

统计信息的采集流程

获取锁

校验权限

确定采样函数

收集统计信息

采集函数

正式采样

统计数据

小结

总结

参考

您可能也对以下帖子感兴趣